Towards scaling up induction of second-order decision tables
نویسنده
چکیده
One of the fundamental challenges for data mining is to enable inductive learning algorithms to operate on very large databases. Ensemble learning techniques such as bagging have been applied successfully to improve accuracy of classification models by generating multiple models, from replicate training sets, and aggregating them to form a composite model. In this paper, we adapt the bagging approach for scaling up and also study effects of data partitioning, sampling, and aggregation techniques for mining very large databases. Our recent work developed SORCER, a learning system that induces a near minimal rule set from a data set represented as a second-order decision table (a database relation in which rows have sets of atomic values as components). Despite its simplicity, experiments show that SORCER is competitive to other, state-of-theart induction systems. Here we apply SORCER using two instance subset selection procedures (random partitioning and sampling with replacement) and two aggregation procedures (majority voting and selecting the model that performs best on a validation set). We experiment with the GIS data set, fi-om the UCI KDD Repository, which contains 581,012 instances of 30x30 meter cells with 54 attributes for classi&ng forest cover types. Performance results are reported including results ftom mining the entire training data set using different compression algorithms in SORCER and published results from neural net and decision tree learners.
منابع مشابه
Scaling production and improving efficiency in DEA: an interactive approach
DEA models help a DMU to detect its (in-)efficiency and to improve activities, if necessary. Efficiency is only one economic aim for a decision-maker; however, up- or downsizing might be a second one. Improving efficiency is the main topic in DEA; the long-term strategy towards the right production size should attract our attention as well. Not always the management of a DMU primarily focuses o...
متن کاملThe Power of Second-Order Decision Tables
The success of data mining techniques can be measured by the usefulness of models they produce. Often these models must be explainable as well as accurate. While decision tables are easy to interpret and explain to virtually all users, there has been little study of whether such simple models are powerful enough to use for data mining. This paper presents SORCER, a learning system that induces ...
متن کاملUniversal Access to Surgical Care and Sustainable Development in Sub-Saharan Africa: A Case for Surgical Systems Research; Comment on “Global Surgery – Informing National Strategies for Scaling Up Surgery in Sub-Saharan Africa”
National level experiences, lessons learnt from the Millennium Development Goal (MDG) era coupled with the academic evidence and proposals generated by the Lancet Commission on Global Surgery (LCoGS) together with the economic arguments and recommendations from the World Bank Group’s “Essential Surgery” Disease Control Priorities (DCP3) publication, provided the impetus for political commitment...
متن کاملCompression-Based Induction and Genome Data
Our previous work developed SORCER, a learning system that induces a set of rules from a data set represented as a second-order decision table. Second-order decision tables are database relations in which rows have sets of atomic values as components. Using sets of values, which are interpreted as disjunctions, provides compact representations that facilitate efficient management and enhance co...
متن کاملParallel Rule Induction with Information Theoretic Pre-Pruning
In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work ...
متن کامل